Competition for Popularity in Bipartite Networks 
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We present a dynamical model for rewiring and attachment in bipartite networks in which edges 
are added between nodes that belong to catalogs that can either be fixed in size or growing in size. 
The model is motivated by an empirical study of data from the video rental service Netflix, which 
invites its users to give ratings to the videos available in its catalog. We find that the distribution 
of the number of ratings given by users and that of the number of ratings received by videos both 
follow a power law with an exponential cutoff. We also examine the activity patterns of Netflix 
users and find bursts of intense video-rating activity followed by long periods of inactivity. We 
derive ordinary differential equations to model the acquisition of edges by the nodes over time and 
obtain the corresponding time-dependent degree distributions. We then compare our results with 
the Netflix data and find good agreement. We conclude with a discussion of how catalog models can 
be used to study systems in which agents are forced to choose, rate, or prioritize their interactions 
from a very large set of options. 
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Human dynamics, which is concerned with the 
characterization of human activity in time, has 
been the subject of intense and exciting research 
over the last few years In one typical prob- 

lem setting, individuals are endowed with limited 
resources, and there are numerous activities, be- 
haviors, and/or products that compete against 
each other for those resources. Although such sit- 
uations admit a natural formulation using bipar- 
tite (two-mode) networks that connect individu- 
als to activities, human dynamics has surprisingly 
seldom been studied from this perspective. In the 
present paper, we analyze bipartite networks con- 
structed from a large data set of video ratings by 
the users of a video rental company over a pe- 
riod of six years. To analyze the time evolution 
of these networks, we introduce the concept of 
a catalog network, and we use this approach to 
explore the driving forces behind the video rat- 
ing behavior of individuals. We believe that such 
a framework can be used to study many other 
phenomena in human dynamics that involve the 
allocation of and competition for scarce resources. 
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I. INTRODUCTION 

Numerous natural and man-made systems involve in- 
teractions between large numbers of entities. The struc- 
tural configuration of interactions is typically rather com- 
plicated, so the study of such systems often benefits 
greatly from network representations 0, S 0]- A net- 
work is usually abstracted mathematically as a graph 
whose nodes represent the entities and whose edges rep- 
resent the interactions between the entities @. In many 
cases, edges can be weighted or directed, and more com- 
plicated frameworks such as hypergraphs can also be em- 
ployed. The number of edges connected to a node in an 
unweighted network is known as its degree, and the de- 
gree distribution of a network is given by the collection of 
numbers that give the fraction of nodes that have degree 
k (for all values of k) In weighted networks, one con- 
siders the weight of an edge rather than simply whether 
or not it exists. 

Because networked systems are not static, the last 
decade has witnessed a particular interest in models that 
attempt to address their growth and evolution Q. Per- 
haps the best-known model of network growth was for- 
mulated by Barabasi and Albert [1, @ . Similar models 
were also constructed decades earlier by Simon [Toj and 
Price . Barabasi and Albert examined networks aris- 
ing from diverse settings and found that their degree dis- 
tributions often seemed to follow power laws, which are 
functions of the form f(x) ~ x~ a (with a > 0). They 
proposed a growth mechanism, which they called pref- 
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FIG. 1: (Color online) A bipartite network with nodes in the 
partite sets U = {1,2,3,4} and M = {A, B,C}. Each edge 
connects a number to a letter. 



erential attachment (Price had called it cumulative ad- 
vantage) to try to explain their observations. One starts 
with a small seed network and — in the simplest form of 
the mechanism — iteratively adds individual nodes that 
each possess exactly one edge. One connects each new 
node to an existing one chosen at random with probabil- 
ity proportional to its degree. That is, the probability to 
choose node rrii with degree ki is 



where the total number of nodes N indicates the size 
of the network. Because nodes with higher degrees have 
correspondingly higher probabilities to receive new edges, 
the preferential attachment growth mechanism leads nat- 
urally to a power- law degree distribution [j| [l2j ■ 

Because of ideas like preferential attachment and the 
resulting insights on the origin of heavy-tailed degree dis- 
tributions that one sees, e.g., in the World Wide Web or 
scientific collaboration networks, the study of networks 
has grown immensely during the last ten years 0-0] ■ 
However, most of this research has concentrated on one- 
mode (unipartite) networks, in which all of the nodes 
are of the same type. It is perhaps under-appreciated 
that other graph structures are also very important in 
many applications 13[. Even the simplest generaliza- 



tion, known as a two-mode or bipartite network, has been 
studied much more sparingly than unipartite networks. 
Bipartite networks contain two categories (partite sets) 
of nodes: U = {u\, U2, ■ ■ ■ , Uu} (with U members) and 
M. = {mi, mi, . . . , Tom} (with M members). As shown 
in Fig. [TJ each (undirected) edge connects a node in U to 
one in M. Q. Bipartite networks abound in applications: 
They can represent affiliation networks in which people 
are connected to organizations or committees [lj], eco- 
logical networks with links between cooperating species 
in an ecosystem [l5[, and more [l6l - l20j . 

A bipartite network possesses a degree distribution for 
each of the two node types. We denote the adjacency 
matrix of a weighted bipartite network by G £ M. UxM . 
Each matrix element has a nonzero value if and only 
if an edge exists between nodes itj and rrij . We denote the 
matrices that result from the two unipartite projections 



The degree of a node in a unipartite projection network 
is then the number of nodes of the same type with which 
the node shares at least one neighbor in the original 
bipartite network. The node strengths similarly incor- 
porate connection strengths from the original bipartite 
network. (Recall that the "strength" of a node is the 
sum of the strengths of the edges connected to it.) For 
example, in an unweighted affiliation network, the two 
projections give the weighted connection strength (the 
number of common affiliations) among individuals and 
the interlock (the number of common people) among or- 
ganizations dHHH]. 

Many of the real-life systems that can be represented 
by bipartite networks are dynamic, as the existence and 
connectivity of both nodes and edges can change in time. 
For example, a person might retire or leave one orga- 
nization to join another. One of the simplest types of 
changes is edge rewiring, in which one end of an edge is 
fixed to a node and the other end moves from one node 
to another (such as in the aforementioned change of affil- 
iation). Because of the important insights they can offer, 
network rewiring models have received increasing atten- 
tion Oil [22h26| . They are closely related to abstract 
urn models from probability theory (27l - [29j . models of 
language competition (30| , and models of transmission of 
cultural artifacts [3l[ . More generally, they can help lead 
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to a better understanding of any system in which the na- 
ture or existence of an interaction among agents changes 
over time @. 

The rest of our presentation is organized as follows. In 
Section [IIJ we analyze a large data set of time-stamped 
video ratings from the video rental service Netflix that 
we model as a bipartite network of people and videos. In 
Section Hill we examine the bursty behavior of individ- 
ual users. In Section IIV1 we develop a catalog model of 
bipartite network growth and evolution. We then study 
the Netflix data using this model in Section [V] Finally, 
we discuss our results and present directions for future 
research in Section I VII 



II. NETFLIX VIDEO RATINGS 

Netflix is an online video rental service that encour- 
ages its users to rate the videos they rent in order to 
improve their personalized recommendations. As part of 
the Netflix Prize competition }32j , in which the company 
challenged the public to improve their video recommen- 
dation algorithm, Netflix released a large, anonymized 
collection of user-assigned ratings of videos in its catalog. 
In this paper, we use the Netflix data to study human 
dynamics in the form of video ratings from a limited cat- 
alog. One can also examine the dynamics of the ratings 
themselves, which would complement a recent empirical 
study of video ratings that used data from the Internet 
Movie Database (IMDB) [H|. The Netflix data consist of 
100,480,507 ratings of 17,770 videos. The ratings, which 
were given by 480,189 Netflix users between October 1998 
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FIG. 2: Number of daily ratings for each day in July and 
August 2003. The mean number of ratings per day over this 
period is 30,449. The dashed vertical lines indicate Tuesdays. 

and December 2005, were sampled uniformly at random 
by Netflix from the set of users who had rated at least 20 
videos [34[. Each entry in the data includes the video ID, 
user ID, rating score (an integer from 1 to 5), and submis- 
sion date. To illustrate some of the temporal dynamics 
in the data, we show in Fig.[2]the total number of ratings 
for each day from July to August 2003. The number of 
daily ratings exhibits a weekly pattern in which Mondays 
and Tuesdays have the highest activity and Saturdays 
and Sundays have the lowest. This reflects the weekly 
patterns in human work-leisure habits. 

Figure [3] shows the total number of ratings from 2000 
to the end of 2005. These ratings seems to grow expo- 
nentially, which we confirm by fitting the data to the 
function 

r(t) = a r (e M - l) (1) 

using nonlinear least squares. We obtain the parameter 
values a r « 6.3656 x 10 5 and b r « 0.0024. 

The number of users also grows exponentially, as shown 
on the top panel of Fig [H The dashed curve in the plot 
is the fit to 

u(t) = a u (e Kt - 1) , (2) 

where we obtain a u « 1.0018 x 10 4 and b u w 0.0018. We 
will need to take the exponential growth of the system 
into account when comparing data from dates that are 
far apart from each other. 

In the bottom panel of Fig. [4J we show the number of 
videos from 2000 to 2005. The number of videos appears 
to grow roughly linearly as a function of time, but in fact 
it is better described by the relation 

m{t) = a m + b m t Cm , (3) 
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FIG. 3: (Color online) Number of ratings in the Netflix data 
versus time from the beginning of 2000 to the end of 2005. 
Circles indicate data from Netflix and the dashed red curve 
is a fit to equation {TJ. 
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FIG. 4: (Color online) Number of users (top) and videos (bot- 
tom) in the Netflix data versus time from the beginning of 
2000 to the end of 2005. Circles indicate data from Netflix 
and the dashed red curves are fits to equations ([2} and (|3} for 
users and videos, respectively. 



where fitting yields a m ~ 2780.00, b m w 0.6705, and 
c m « 1.3097. 



A. Bipartite Network Formulation 

The Netflix data can be represented as a bipartite net- 
work. The two types of nodes in this network are users 
and videos. We use U to denote the set of users and M 
to denote the set of videos. We ignore the rating val- 
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a 


b 




Mean 


Var 


Mean 


Var 


Videos 


0.6580 


0.0200 


0.0686 


0.0100 


Users 


0.8381 


0.0573 


0.0116 


0.0007 



TABLE I: Fitting parameters of the daily video and user de- 
gree distributions from 2000 to 2005 for the power law with 
exponential cutoff in Q. 



ues and consider only the presence or absence of a rating 
event, which corresponds to an edge between a user and 
a video in the unweighted bipartite network. The large 
size and longitudinal nature of the data provides a valu- 
able opportunity to study video rating in the context of 
human dynamics, as has been done previously with mo- 
bile telephone networks [1, |H[, book sale rankings [36|. 
and electronic and postal mail usage patterns [l|, [S^ • 



B. Degree Distributions 
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FIG. 5: (Color online) Cumulative degree distributions of user 
(top) and video (bottom) nodes for August 26, 2003 (a Tues- 
day). The dashed curves are the fits to equation ([4]) with pa- 
rameters a « 0.9828, b » 0.0057 for the users and a w 0.6622, 
b w 0.0070 for the videos. 



The bipartite video-rating network has one degree dis- 
tribution for the user nodes and another one for the video 
nodes. Keeping in mind the observations in Fig.[2j we ex- 
amine the cumulative degree distributions of individual 
days. The distributions have a similar functional form 
for each day in the data set. We fit them to a power law 
with an exponential cutoff, 

F(k) ~ k- a e- bk , (4) 

using a modification of the method discussed by Clauset 
et al. [39( | . As an example, we show in Fig. [5] the cumu- 
lative degree distributions for one day. Table Q] gives the 
parameter values that we found in our fits of the data to 
equation (J3J. Despite the weekly pattern of the ratings 
shown in Fig. [2j we did not find any significant differ- 
ences between the values of a and b for different days of 
the week. Hence, although the number of daily ratings 
does differ significantly among weekdays, such differences 
seem to not have much effect on the aggregate structure 
of the network. 

The problem setting sheds some insight into the ob- 
served functional form of the degree distribution. Users 
select which videos to rate from a large set of possibili- 
ties and possess time limitations on the number of videos 
that they are able to watch and rate. As in any market, 
videos must compete against each other for users' atten- 
tion. One can also anticipate that certain videos saturate 
their market, especially in the case of niche videos whose 
audience is small to begin with. Once the demand for a 
niche video has been met, it virtually ceases to receive 
further ratings. On the other hand, blockbusters might 
continue receiving numerous ratings for a long period of 
time. 



C. Clustering coefficients 

To investigate the local connectivity of nodes and ex- 
amine the impact of highly-connected nodes, we calcu- 
late bipartite clustering coefficients [ljl H(| ■ In bipartite 
networks, a clustering coefficient for a node can be calcu- 
lated by counting the number of cycles of length 4 (i.e., 
the number of "squares" ) that include the node and di- 
viding the result by the total possible number of squares 
that could include the node. As stated by Zhang et al. 
in [l6[ , the possible (or underlying) number of squares is 
calculated by adding the potential links (including exist- 
ing ones) between a particular node and the neighbors of 
its neighbors. In Fig. [6] we show how a square occurs in 
a bipartite network when two neighbors of a node have 
another neighbor in common. Bipartite networks cannot 
have triangles (three mutually-connected nodes) because 
two nodes of the same type cannot be neighbors, so a 
square is the shortest possible cycle. 

The definition of a clustering coefficient of node m-i in 
an unweighted bipartite network is [l6j : 

c ^ = s^R ^ttt V^^' ® 

T,j,h [( k J - Vi ih ) + ( k h - % h ) + % h \ 

where qi h is the observed number of squares containing 
mj and any two neighbors Uh and Uj . The degrees of the 
neighbors are kh and kj, respectively, and i)i jh = qi jh + 1. 
The possible number of squares is calculated adding the 
degrees of the nodes Uh and Uj minus the link that each 
shares with m,j if the three nodes are not part of a square 
to avoid double-counting. If the three nodes are part of 
a square, then the square represented by the deleted link 
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FIG. 6: (Color online) Examples of how to calculate clus- 
tering coefficients for bipartite (top) and unipartite (bottom) 
networks. In the bipartite network, solid lines indicate edges 
that form the square that includes node B, whose bipartite 
clustering coefficient calculated according to equation ((5| is 
d = 1/5. One obtains this result because there are five 
possible squares for this node ({1A2B, 1C2B, IA4B, 1C4B, 
2C4B}) but only one of them (2C4B) actually exists. In the 
unipartite network, the solid lines indicate edges that form 
the triangles that include node 1. If this were an unweighted 
network, for which Gij G {0, 1} for all i and j, then one would 
obtain an unweighted clustering coefficient of 6*3(1) = 2/3. To 
calculate the value of its weighted clustering coefficient C3, we 
use equation (JSJ. 



must be added again, hence (kj -% h ) + (k h - rj ijh ) + q ijh 
in the denominator of equation (J5]). 

In Fig. we show the values of C^rrn) for the video 
and user nodes for a single day (Tuesday, August 12, 
2003). In Table [U we show the mean values of the bi- 
partite clustering coefficient of all one-day snapshots of 
Netflix in 2003. In spite of the weekday-dependent vari- 
ation in the number of daily ratings, the values of the 
bipartite clustering coefficient do not vary significantly 
across weekdays. However, the values of (C4) increase 
almost by 80% for both node-types on weekends. For a 
network constructed from a single day's data, only about 
2% of the possible squares typically exist; this is com- 
parable to what would occur in a random network with 
the same degree distributions. To investigate whether 
the presence of blockbuster nodes (which have high de- 
grees and increase considerably the number of possible 
squares) has any effect on the value of (C4), we calcu- 
lated the clustering coefficient after removing the top ten 
most rated videos. We did not find any conclusive evi- 
dence of blockbusters driving the value of the clustering 
coefficient; some of them caused the value of (C4) to go 
down and others caused it to go up. 

One can also examine clustering coefficients in the 
weighted unipartite networks given by the projected adja- 
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FIG. 7: (Color online) Bipartite clustering coefficients C 4 (m;) 
for video (blue) and user nodes (inset, green) for August 
12, 2003 (a Tuesday). The mean values for this day are 
<C 4 ) = jj YlfU Ci(mi) w 0.02606 for the videos and (C 4 ) = 
V Th=i Ciiui) w 0.03144 for the users. 







(c 3 ) 




mean 


var 


mean 


var 


Videos 


0.02039 


0.0007 


0.0056 


10" 6 


Users 


0.02092 


0.0012 


0.0044 


HT 6 



TABLE II: Means and variances of (C4) (for the bipartite 
network) and (C3) (for the projections) of videos and users on 
single-day snapshots of 2003, calculated using equations ([5]) 
and (0. 



cency matrices Gu and . We calculate the weighted 
cluste ring coefficient for each projection using the for- 
mula mf 



C 3 (mi) 



ki\k% 1) 



1 

Gm 



(GijGihGh 3 ) 



1/3 



(6) 



where ki is again the degree of node rrij , Gi 
of the edge between m, and mj, and G 



M 



is the weight 
= max(Gij) 

denotes the maximum edge weight in the network. The 
geometric mean (GijGihGhj) 1 ^ 3 of the edge weights give 
the "intensity" of the /i)-triangle. When the network 
is unweighted, (GijGihGhj) 1 ^ 3 is 1 if and only if all edges 
in the (i, j, /i)-triangle exist and if they do not, reduc- 
ing the equation to the unweighted unipartite clustering 
coefficient 



C 3 (mi) 



2t, 



ki^ki l) 



(7) 



where U is the number of triangles that include node m^. 

In Fig. [51 we show the C^(ui) values for the user pro- 
jection Gu (with 10,228 nodes and 814,667 edges) from 
Tuesday, August 4, 2003. In Table [TH we show the mean 
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FIG. 8: Weighted clustering coefficient Cz(ui) for nodes in the 
unipartite projection onto users for August 4, 2003. The x- 
axis represents node degrees, and the t/-axis represents C$(ui). 
The mean values for this day are (C3) = jj JTJiLi Cs(ui) ~ 
0.0013 for the projection onto users and C3 « 0.0086 for the 
projection onto videos (not shown). The inset shows values 
of the unweighted coefficient Cs(ui) from the same data. 




FIG. 9: (Color online) Cumulative distribution of the inter- 
event time between the ratings of one Netflix user. The user 
signed up on April 4, 2000, and has a degree of 940 based on 
ratings cast over a period of almost five years. The dashed 
curve indicates the fit to the function F(x) ~ x~ a , which 
yields a w 2.27 in this case. The inset shows the number of 
days between consecutive video ratings. 



III. USER BURSTS 



clustering-coefficient values for the projected user and 
video networks for all single-day snapshots of 2003. The 
values of (£73) did not vary much among weekdays, ex- 
cept for the videos' (£73) that almost doubled its value 
on the weekends from an average of 0.0045 from Monday 
to Friday to 0.0086 on Saturday and Sunday. 

Given the values of (C4) in Table [III it is unsurpris- 
ing that the values of (C3) are also typically low. In 
the inset of Fig. [8j we show the values of the users' 
unweighted clustering coefficient C3, which are natu- 
rally much higher. For example, about 4000 users have 
C3 = 1.0, indicating that all potential triangles exist 
among these users. This differentiates one set of nodes 
from the rest. This feature, which we observe often in the 
data, arises from the dominant video of the day. For Au- 
gust 4, 2003, this video (which is typically a blockbuster) 
was Daredevil, which had 396 ratings and created many 
edges in the user projection among the users who rated it. 
Removing Daredevil from the bipartite network also re- 
moves these deviant nodes. This feature is not apparent 
if one calculates only the unweighted unipartite cluster- 
ing coefficient C3. Just as we did with (£74), and given 
the dramatic effect observed by removing Daredevil, we 
calculated (C4) for the projected network of users remov- 
ing the ten most rated videos. We found that for every 
additional video removed, the value of (C3) increased by 
0.2%, while for (C3) the increment was slightly larger. 



A close examination of the rating habits of individual 
users can also yield rich and informative insights. Recent 
research has shown that people tend to have bursts of e- 
mail and postal correspondence, in which they send and 
receive numerous messages within short periods of time, 
followed by long periods of inactivity [l], H3, [38| . We find 
similar features in the Netflix data, as about 70% of the 
users exhibit bursty behavior by rating several videos in 
one go after several days of no activity. We illustrate this 
phenomenon in Fig. [S] by plotting the cumulative distri- 
bution of inter-event times between the ratings of one 
user over a period of almost five years. We fit this distri- 
bution to a power law F(x) ~ x~ a using the method dis- 
cussed in Ref. [3j| to determine the value of the exponent 
a. We can similarly provide estimates for possible power 
laws (with actual power laws over roughly two decades of 
data) among the other bursty users, though the value of 
a depends on the final degree (i.e., the total number of 
rated videos) of the user. For example, bursty users with 
final degrees between 100 and 1000 have a mean expo- 
nent of a rj 2.54, whereas those with final degrees of at 
least 4000 have a mean exponent of a rj 3.17. Addition- 
ally, there are several types of users among those who do 
not exhibit bursty dynamics. In particular, some users 
rated only a very small number of videos (which may be 
due to the sampling done by Netflix) and others exhibit 
seemingly unrealistic levels of rating activity. (For exam- 
ple, there are 47 users who signed up in January 2004 or 
later and who have rated more than 4000 videos each.) 
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IV. CATALOG NETWORKS 

The above empirical investigation of the Netflix data 
motivates the development of an evolution model for bi- 
partite catalog networks, which arise in a diverse set of 
applications. Such networks have two sets of nodes whose 
numbers can be fixed or dynamic, and edges are placed 
one at a time between previously unconnected edges that 
are chosen according to predefined rules. One contin- 
ues to add edges until a predefined final time has been 
reached or the system has become saturated, at which 
point every node in one partite set is connected to ev- 
ery node in the other partite set. The Netflix network 
can be studied using such a catalog network framework; 
it starts completely disconnected (nobody has rated any 
videos), and the users start choosing and rating videos 
from the catalog. Depending on the way the data set is 
sampled, the catalogs can be static (e.g. a one-day snap- 
shot) or dynamic (e.g. the full data set). Catalog models 
of network evolution are closely related to the network 
rewiring problem studied by Plato and Evans [U, [l9j that 
features fixed sets of artifacts and individuals. Every in- 
dividual has one affiliation (a connection) with an artifact 
and can reassign this connection to another node as the 
network evolves. In contrast, in a catalog network, any 
edge that has been placed between two nodes in the net- 
work is permanent. Consequently, catalog networks are 
suited to describing records of interactions that are as- 
signed dynamically and then remain permanently in the 
system. 

As before, U denotes the set of users and A4 denotes 
the set of videos. The size of U is u(r) and the size of A4 
is m(7"), where r denotes a discrete time that is indexed 
by the ratings. That is, we take every rating event as a 
time step, so when we discuss time in this context, we are 
referring to "rating time" and not physical time unless we 
indicate otherwise. Because m(r) and u(r) are not always 
integers, we define U(r) = [u(r)\ and M(r) = [m(r)\ 
as the (nonnegative integer) numbers of user and video 
nodes, respectively. The associated time-dependent cat- 
alog vectors, Du and Dj^i, have components given by the 
degrees of each node in the catalog: 



D u {r) 



k Ul {r) 
k U2 (r) 



D M (r) 



km 2 ( r ) 



r(r)(f) 



(8) 

These vectors have size U(r) and M(r), respectively. 
We denote by N u (r, k) (with k G {0, 1, ... , M(r)}) and 
Nm(t, k) (with k £ {0, 1, . . . , U(r)}) the numbers of 
users and videos, respectively, that have degree k at rat- 
ing time r. One can normalize Nu(r,k) to obtain the 
proportion of nodes with degree k given by Nu(r,k) = 
UjyrNu(r, k). An analogous relation holds for Nj^{r, k). 

Based on our intuition about the choosing and rating 
of videos, we add edges to the network using a combi- 



nation of linear preferential attachment and uniform at- 
tachment. On one hand, one expects the choice of a user 
to be driven in part by the choices made by others, as 
popular videos are more likely to attract further view- 
ings and hence ratings. On the other hand, one also 
expects an element of idiosyncrasy on the part of each 
user, allowing him or her to choose any video from the 
catalog regardless of the choices of others. This results 
in two time-dependent probabilities — one for users and 
one for videos — each of which consists of a convex com- 
bination of preferential and uniform attachment. More 
specifically, each time an edge is added to the network, 
we select a user and a video to be connected by this new 
edge. The video (user) node is chosen using uniform at- 
tachment with probability 1 — q (respectively, 1 — p) and 
linear preferential attachment with probability q (respec- 
tively, p). The addition of an edge occurs during a single 
discrete (rating) time step, as is common in models of 
network evolution. Combining these ideas, a video node 
with degree fcj is chosen with probability 



Pm (r, h 



M(r) - N M (r,U(r)) 
qki 



\D M (r)\\i-U(r)N M (r,U(r)) 



, (9) 



and a user node with degree hi is chosen with probability 
1-p 



Pu{r, hi) 



U(r)-N u (r,M(r)) 
phi 



\\D u {r)\\i-M(r)N u {r,M{r)) 



(10) 



where the values of the parameters p,q S [0, 1] are fixed, 

\\Du(r)\\i =Ef= ( i ) W, and \\D M {r)\U = K{r). 
The probabilities Pu(r,hi) and P^r, fc^) change over 
time as the degrees of the nodes change when edges are 
added to the network. 

The denominators in equations (| MTU|) contain the 
terms N_m(i", U (r)) and Ny(r, M{r)) because once a node 
of either type is fully connected, it is no longer eligible to 
receive any new connections and is effectively no longer 
in the catalog until a new node of the other type arrives. 
When r = 0, one obtains ||£> m (0)||i = ||A,(0)||i = 
and N M (0,U(r)) = N u (Q,M(r)) = 0, which would re- 
sult in division by zero. To overcome this problem, 
we follow the standard procedure employed in network 
growth models Q by seeding the algorithm with an edge 
that connects two randomly-chosen nodes (one from each 
of the partite sets). This is equivalent to shifting the 
rating-time variable and changing the initial conditions 
to \\D m (0)\\ 1 =\\D v (p)\\ 1 = l. 



A. Rate Equations 

One can use rate equations (i.e., master equations) to 
investigate the dynamics of the degree distributions of a 



8 



catalog network. This type of approach has been used 
successfully to study a variety of other networks @, H, 
HI d, S3, El- The analysis of the degree distribution 
for videos in the catalog model is identical to the one 
for users, as only the constants and sizes of the catalogs 
are different. Accordingly, we present our results for the 
degree distributions of the videos; one obtains the results 
for user distributions by changing q to p, M(r) to U(r), 
and PM( r ,k) to Pu(r,k). For notational convenience, 
we also drop the subscripts in this subsection, so N(r, k) 
denotes the number of nodes with degree k at time r. 
To construct the rate equations, one must consider how 
many nodes pass through N(r, k) (i.e. turn into nodes 
of degree k and k + 1) for k E {0, 1, 2, ... , U(r)}. This 
yields 



dN{r,0) 

dr 
dN{r,k) 

dr 



■m'(r)-P M (r,0)N{r,0), 

■ P M (r,k-l)N(r,k-l) 
P M {r,k)N(r,k), fc>0, 



(11) 



where m'(r) 



dm ( r > , The initial conditions are 



dr 



1 , 



N(0,0) = M(0) 
AT(0,1) = 1, 
N(0, k) = , k > 1 



(12) 



Equation pip is a system of coupled nonlinear ordinary 
differential equations (ODEs). The positive and negative 
terms account, respectively, for an increase and decrease 
in the number of nodes of a given degree as nodes re- 
ceive new edges. The equation for N(r,0) has m'{r) as 
a positive term to indicate the entry of new nodes (with 
degree 0) to the network. The time-dependent probabil- 
ities PM{.f,k) are defined in equation ([9]). In the case 
of fixed catalogs, there is a maximum value of k, so the 
final equation in (fTTjl takes a slightly different form (see 
below). 



1. Fixed Catalogs 

We begin by analyzing the evolution of the network 
with fixed catalog sizes, so U(r) = U, M(r) = M, and 
m'(r) = for all r. Because a finite, fixed number of 
users and videos are available in the catalogs, the net- 
work can only evolve until time r = UM. At this point, 
the system becomes saturated (i.e., Nu(MU,M) = U 
and N_x4(MU, U) = M), and no additional edges can be 
added to the network. Note additionally that the equa- 
tions in (fTTj) change slightly for fixed catalogs. In partic- 
ular, the last equation for nodes with degree U changes 
to 



dN(r, U) 
dr 



= P M (r,U-l)N(r,U-l), (13) 



which only has one positive term because nodes with de- 
gree U stay that way until the end of the process. 



2.5 



❖ N(500, k) 
□ N(1000, k) 




FIG. 10: (Color online) Degree distributions of video nodes 
averaged over 500 simulations of a fixed catalog network with 
U = 100 users, M = 30 videos, and q = 0.8 at rating times 
r = 500 (red diamonds) and r — 1000 (blue squares). The 
solid curves are the solutions to the differential equation (|11|) . 



Additionally, while the degree distribution of a network 
generated using the catalog model with static node sets 
is time-dependent, the long-time asymptotic behavior is 
always the same: 



lim N(r, k) = 

■-*UM 



M , if k = U, 
, if k < U , 



which gives a de facto final condition to the system in 
(|1HI13|) . Accordingly, we examine degree distributions 
for r < UM - 1. 

In Fig. [TrJJ we show the degree distribution of the video 
nodes averaged over 500 simulations of a fixed catalog 
network with U — 100 and M = 30 at different times 
during its evolution. As the discrete time r increases, the 
peaks of the functions travels towards higher values of k 
and decrease as if they were diffusing. We also observe a 
jump in N(r, k) at k = U. This occurs because there are 
nodes in the network that become fully connected during 
the edge-assignment process (see Fig. QT]) . Interestingly, 
Johnson et al. showed recently that the time-dependent 
degree distributions observed in some networks that un- 
dergo edge rewiring with preferential attachment follow 
nonlinear diffusion processes [45|. 

Figure [12] reveals how the user nodes achieve full con- 
nectivity between r = and r = UM — 1. The image 
shows the "paths" that user nodes follow in the (r, k)- 
plane between (0,0) and (UM — 1,M). For example, 
the nodes that follow a steep (high k for early r) tra- 
jectory are the ones that receive many links early on. 
Their degree grows mostly from preferential attachment 
in the edge-assignment mechanism, and they accordingly 
achieve full connectivity early in the process. The nodes 
that acquire edges more slowly initially begin to receive 
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FIG. 11: (Color online) Numbers of nodes N(r, 0) with degree 
(red triangles) and N(r, 100) (blue circles) with degree 100 
from 500 simulations of a fixed catalog network with U = 100, 
M — 30, and p — 0.8. Inset: Decrease of N(t, 0) on a semi- 
logarithmic scale, which appears to decrease exponentially. 
The solid curves come from the solutions of (fTTj) . 




r 



FIG. 12: (Color online) Mean of N(r, k) for user nodes in 500 
simulations of a fixed catalog network with U = 100, M = 30, 
and p = 0.5. The axes are (rating) time r and degree k, and 
the color indicates the value of log(JV(r, k) + l). The horizontal 
line at the top of the image is the discontinuity (as seen with 
the video nodes in Fig. Hip that corresponds to the value of 
N(r, M) and reflects the appearance of fully-connected user 
nodes. 

edges very fast as r approaches UM (because other nodes 
have already saturated) , explaining the steep climb in the 
upper right corner of the figure. 

The "final" condition that N(UM -1,U) = M makes 
the system in (jll[) very stiff for high values of k and r. 
Fig. shows the path that the video nodes follow in the 
(r, fc)-plane (i.e., the same information as in Fig. [T21 but 
for video nodes) but for the numerical solutions of (|lip 




500 1000 1500 2000 2500 3000 



FIG. 13: (Color online) Numerical solution of N(r, k) for 
video nodes from equation (|11|) with a fixed catalog and q = 
0.8, M = 30, and U = 100. (We again plot log(AT(r, fc) + 1).) 
The horizontal line at k = 100 corresponds to the saturated 
nodes N(r, U). The inset shows a plot of N(r, (7 — 1) for the 
same network. 

instead of direct network simulations. In the inset of the 
Fig. , we show the profile of N(r, U — l) which evinces the 
aforementioned stiffness. Because all nodes must be fully 
connected at r = UM — 1, nodes with low degrees begin 
to receive many edges for high values of r. This causes 
7Y(r, k) for high k to peak late in the process, and the 
nodes "travel" through values of k rather quickly, which 
explains the incredibly steep slope of N(r, U — 1) as r 
approaches UM — 1 . 

The value of q affects the width of the region (light 
colored) in the (r, k) plane. For lower values of q (e.g., 
q = 0.3), uniform random attachment dominates and the 
region of activity becomes narrower. The nodes attain 
edges at roughly the same pace. For larger values of q, 
the first nodes to receive edges become more likely to con- 
tinue receiving more nodes until they saturate, and the 
area of activity of the nodes becomes wider (see Fig. [13]) . 

2. Growing Catalogs 

In the previous section, we described the dynamics of 
catalog networks when the sizes of the catalogs are fixed. 
While this provides a good baseline investigation, cata- 
logs can grow in many applications — for example, Netflix 
gains both new subscribers and new videos almost every 
day. Accordingly, in this section we study the dynamics 
of (fTTj) for growing catalogs for which m!(r) > 0. 

The system no longer has an obligatory final time, 
and the saturation level of nodes is now time-dependent. 
For example, a user that has degree M(r) is saturated 
temporarily until a new video "arrives" — i.e., until time 
r + Ar so that M(r + Ar) — M(r) > and there is a new 
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FIG. 14: (Color online) Numerical solution of N(r, k) for 
video nodes from equation (fTTj) with q — 0.8, m(r) = 30 + 
0.007r, and U = 100 + 0.05r. (We again plot log(JV(r, fe) + l).) 
The increasing diagonal line gives U(r) and represents the 
temporarily saturated nodes. In the inset, we show a plot of 
N(r, 0) on a semilogarithmic scale. We observe a rapid initial 
decrease followed by a slower increase as the catalog grows. 



video to rate. 

In Fig. 114) we show a numerical solution to equation 
(jlljl where m(r) and u(r) are linear functions of r. In- 
stead of the horizontal line of fully connected nodes along 
k = 100 in Fig. [131 the saturation of the nodes follows 
the growth of U(r). In the inset of Fig. [Mj we show 
the time profile of N(r, 0). Initially, it has what appears 
to be exponential descent before it starts to grow slowly 
as the catalog size increases, in contrast to what we ob- 
served in Fig. 1111 The early rapid decay is explained by 
the absence of many nodes with high degrees, so nodes 
with lower degrees receive edges. As r increases, the 
better-connected nodes receive more edges (because for 
q = 0.8 the dominant mechanism is linear preferential at- 
tachment) and the population of nodes with fewer edges 
increases slowly. In Section [V] we discuss how the Nctflix 
data displays some of these features. 



V. NETFLIX AS A CATALOG NETWORK 

We now investigate how well our catalog model cap- 
tures the human dynamics revealed by the Nctflix data. 
To do this, we sample the data set while keeping in mind 
the following considerations: 

• Because of the way we have defined our catalog net- 
work growth model, we must consider the evolution 
of the Netflix data in "rating time" , in which ev- 
ery new rating (which adds an edge in the network) 
constitutes a time step. 

• Although there might be a (physical) time differ- 
ence between a node (either user or video) joining 



Nctflix and the node receiving its first edge, this in- 
formation is not included in the data. Many videos 
receive more than one rating on their first day, so 
their entry to the network is reflected by increases 
in the value of N(r, k) for several values of k. We 
will have to take this into account when comparing 
our model to the data. 



A. Growth and Dynamics 

To compare our results to the data, we express the 
growth of the numbers of videos and users as a function 
of rating time r. Solving for t in equation ((T|) gives 



t=llog( — 

b r \ a r 



1 



(14) 



We substitute (fT4|) into ((2]) to obtain the new expression- 
for the users as a function of ratings: 



u(r) = a u 



r 

— + 1 

a r 



b u /b T 



1 



(15) 



We follow the same procedure for the videos to obtain 
m{r) = a m + b m j _ fog f -1 + 1 H . (16) 
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FIG. 15: (Color online) Users (top) and videos (bottom) as 
a function of ratings. We use circles to show the data from 
Netflix and dashed curves to show the predictions from equa- 
tions (|15[) and {16} . We use the parameter values obtained in 
Sec. [Ill 



In Fig. I15i we show the numbers of users and videos 
versus the number of ratings in the network. Observe 
that the predictions from equations (|15H16|) agree very 
well with the data. 
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FIG. 16: (Color online) Video degree distribution Na&t&(r,k) 
in the Netflix data set in 2000. (We again plot 
log(iVdata('*, k) + 1).) We show data for videos with degrees 
ranging from 1 to 4794. 



Figure [16] shows the time-dependent degree distribu- 
tion of videos in the Netflix data set for the year 2000. 
The sample in the plot consists of 365 measurements (one 
for each day) of r and N(r,k). The highest degree in 
this sample is 4794; this is well below the theoretical 
maximum of 9289 according to the expression for u(r) in 
equation (|15p . so the network is not experiencing node 
saturation. We can rewrite the probability that a video 
node receives an edge as 

PM(r,ki) = + 



Mir) \\D M {r)\\i " 

The rate equation for the evolution of the degree distri- 
bution is 



dN(r, 1) 

dr 
dN(r, k) 

dr 



-S im '(r)-P M (r 7 l)N(r,l), 

--6 k m'{r)+P M {r,k-l)N(r,k-l) (17) 
- P M {r,k)N{r,k), k > 1. 



The initial conditions are N(0, 1) = m(Q) and -/V(0, k) = 
for k > 1. As noted earlier, the lowest degree a node 
can have in the data is 1 and the entry degree of the 
nodes can have any value of k. We denote by S k the 
proportion of new nodes whose entry degree is k, such 
that ^2 k Sk = 1. We investigated how many ratings do 
videos receive on the day that they entered the system 
and found that over 97% of the new nodes receive 3 or 
fewer ratings. Consequently, we have set Si = 0.8, 5 2 = 
0.15, and S 3 = 0.05. 

To see how well our model describes the Netflix video 
data in the year 2000, we define N k (q) as the 4794 x 365 
matrix obtained solving the system in (JT7J) and iVdata 



obtained from the data sample. These two matrices con- 
tain the values of N(r, k) from the sample and from the 
equations for all values of k and r. The matrices are of 
the given size because we sample the degree distribution 
once per day and the maximum degree observed is 4794. 
We define the error function 



E(q) = \\N k (q)-N daU \\, 



(18) 



where || • || is the Euclidean matrix norm. To find the 
optimum value q* , we minimize Ejq ) u sing the Nelder- 
Mead derivative-free simplex method [46| . We found that 
the value of q that minimizes (fT8j) is q* as 0.9795, meaning 
that according to the model about 98% of the decisions 
to rate a video by users are guided by its popularity (i.e., 
preferential attachment). 




x 10 



FIG. 17: (Color online) Values of N(r, 10) (videos with degree 
10) obtained by solving (|17|l using q = 0.9795 (red curve) and 
the data from Netflix that we report in Fig. 1161 (blue dots). 



In Figs. [17] and [TBI wc compare the values of N(r, k) 
that we obtained in our model to those in the data. In 
spite of the noise in the data, our model is able to repro- 
duce the temporal dynamics of N(r, k). 

In Fig. 1191 we show the approximation of our model 
to the cumulative degree distribution of the videos on 
the last day of the sample (i.e., for all values of k and 
r = 915628, the number of ratings at the end of year 
2000), which agrees very well with the data. 

Although q* as 0.9795 suggests that the way the users 
choose to rate videos is dominated by the popularity of 
the films, we should stress that the model developed here 
is a very simple one. There are probably many other 
processes influencing the decisions of the users, including 
different external (to the user) factors, such as advertise- 
ments, press, and the underlying social network the users 
are embedded in. 
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FIG. 18: (Color online) Values of N(r, 50) (videos with degree 
50) obtained by solving (|17|l using q — 0.9795 (red curve) and 
the data from Netflix that we report in Fig. \W\ (blue dots). 
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FIG. 19: (Color online) Cumulative degree distribution of 
video nodes on the last day (915628 ratings) of the sample 
from year 2000. We obtained this by solving (|17|) using q = 
0.9795 (red curve) and directly from the data (blue dots). 



VI. CONCLUSIONS 

We have analyzed a large network of video ratings 
given by the users of the Netflix video rental service. We 
studied the system using a bipartite network of videos 
and users and employed this perspective to reveal inter- 
esting features in the dynamics of video rating, such as 
weekly patterns in video ratings and bursts of activity 
followed by long idle periods. We calculated clustering 
coefficients for one-day snapshots, concluding that their 
low values arise from the presence of high-degree nodes 
(i.e., videos with a large number of ratings and users who 



rate many videos). We also showed that the degree distri- 
butions of both the user and video nodes resemble power 
laws with exponential cutoffs. 

Motivated by the structural and dynamical features 
we observed in the Netflix data, we formulated a mech- 
anism of network evolution in the form of "catalog net- 
works" for bipartite systems. Such networks are initially 
empty (aside from a seed), and edges are created between 
two types of nodes based on some predefined rules. New 
nodes can also be added to the network during the wiring 
process. In our model, we considered a combination of 
uniform random attachment and linear preferential at- 
tachment. We derived a set of coupled ordinary differ- 
ential equations that describe the time-evolution of the 
degree distributions of such catalog networks. Presup- 
posing this mechanism and employing the Netflix data, 
we found that users seem to choose videos according to 
preferential attachment about 98% of the time and uni- 
form attachment about 2% of the time. This suggests 
that the number of ratings for a given video is driven 
almost completely by its popularity (preferential attach- 
ment) and only in very small measure by the intrinsic 
preferences of users. While interesting, the extreme dom- 
inance of a preferential-attachment mechanism might be 
due in part to the simplicity of our model and the ab- 
sence of information about the underlying social network 
of the users, which can have considerable influence over 
the video choices. Additionally, our model docs not incor- 
porate external influences such as media coverage and 
promotion campaigns that can certainly affect the popu- 
larity of videos. One can refine such insights by consider- 
ing more sophisticated attachment mechanisms that in- 
corporate the actual scores of the video ratings (not just 
their existence), the age of the videos, user social net- 
works (see Refs. (4?| and [48| for recent interesting study), 
interactions among users, media presence of videos, and 
more. Our simple catalog model thereby serves as a good 
starting point for an abundance of interesting generaliza- 
tions. 

The Netflix data, which is both large and publicly 
available, provides an excellent vehicle to study many of 
the features that have been observed in network repre- 
sentations of systems in which agents exercise preferences 
or choices, such as citation, collaboration, and social net- 
works @,[!im|3il!! HI- In this paper, we formulated 
a catalog model to understand the human dynamics of 
video rating. In our view, catalog models are suitable 
in many other contexts, including studying certain elec- 
toral systems (such as the preamble to preferential voting 
elections) [42|, professional sports drafts [Hi]], and retail 
shopping. To achieve insights in such a diverse array 
of settings, the catalog model presented herein can be 
generalized in numerous interesting ways to incorporate 
external agents, underlying networks or cliques of indi- 
viduals, and more. 
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